In this exercise, we will required the tidyverse, gt and janitor packages.

library(tidyverse)
library(gt)
library(janitor)

The goal of this exercise is to produce some data summaries and get some experience using pipes (%>%). The Metropolitan Museum of Art in New York City maintains a database of more than 470,000 artworks. For the purposes of this exercise, we are going to focus on a small number of European paintings.

(a) Read in the file containing European paintings and take a look at it in RStudio.

The file is called MetEuro.csv and you can read the file in using the read_csv function. Choose met_euro as the name if you want to be consistent with the solutions we provide.

Once you have read the file in as a data frame, you will find it in the Environment tab in RStudio and can explore it more there.

met_euro <- read_csv("MetEuro.csv")
Rows: 15 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Department, Object_Title, Artist_Name, Artist_Nationality, Medium
dbl (2): Artist_Birth_Year, Object_Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(b) Consider what each of the variables might refer to. How many European paintings are included in this data set?

There are 15 rows, and therefore artworks, in the file. Note that the age is given in years.

(c) Produce a frequency table for the Artist nationality. Which one is the most common?

Use tabyl to look at this variable. Recall that gt() produces nicely formatted tables.

met_euro %>% 
  tabyl(Artist_Nationality) %>%
  gt()
Artist_Nationality n percent
British 2 0.13333333
Dutch 4 0.26666667
French 6 0.40000000
German 1 0.06666667
Netherlandish 1 0.06666667
Swedish 1 0.06666667

As demonstrated by the frequency tables of artist nationalities, there are 6 French artists, the most common nationality.

(d) Produce a frequency table for Medium and try adding percentages. Which Medium is the most common?

You can use the adorn options to produce actual percentages for your table.

met_euro %>%
  tabyl(Medium) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting() %>%
  gt()
Medium n percent
Ivory 2 13.3%
Oil on canvas 8 53.3%
Oil on wood 1 6.7%
Pastel 2 13.3%
Vellum 2 13.3%
Total 15 100.0%

Oil on canvas is by far the most common medium used.

Later in this course we will learn how to sort rows in a table by, say, the frequency.

(e) Summarise the Object Age variable. What is the average age of the items in this collection? What is the age of the newest painting? And the oldest painting?

You can use the mean, min and max functions within summarise to obtain these features.

met_euro %>% 
  summarise(Variable = "Age of Object",
            Mean = mean(Object_Age), 
            Max = max(Object_Age),
            Min = min(Object_Age)) %>%
  gt()
Variable Mean Max Min
Age of Object 218.3333 480 129

The average age of the paintings is 218 years old, with the newest painting 129 years olds and the oldest 480 years old.

(f) Summarise the Object Age variable again, but this time grouping by Medium. What is the medium of the oldest painting?

Here we introduce group_by to our code above to look at the descriptive statistics by medium.

met_euro %>% 
  group_by(Medium) %>%
  summarise(Variable = "Age of Object",
            Mean = mean(Object_Age), 
            Max = max(Object_Age),
            Min = min(Object_Age)) %>%
  gt()
Medium Variable Mean Max Min
Ivory Age of Object 240.00 245 235
Oil on canvas Age of Object 185.75 350 129
Oil on wood Age of Object 153.00 153 153
Pastel Age of Object 133.50 137 130
Vellum Age of Object 444.50 480 409

The oldest painting was painted on Vellum.

(g) Extension exercises

Propose some other summaries that might be of interest and provide the code to produce them.

Download the dental decay file used in lectures and attempt to produce your own summaries.


© 2022 Statistical Consulting Centre, The University of Melbourne.